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ABSTRACT 

Summary: Current tools for liquid chromatography and mass spec- 
trometry for metabolomic data cover a limited number of processing 
steps, whereas online tools are hard to use in a programmable fashion. 
This article introduces the Metabolite Automatic Identification Toolkit 
(MAIT) package, which makes it possible for users to perform meta- 
bolomic end-to-end liquid chromatography and mass spectrometry 
data analysis. MAIT is focused on improving the peak annotation 
stage and provides essential tools to validate statistical analysis 
results. MAIT generates output files with the statistical results, peak 
annotation and metabolite identification. 

Availability and implementation: http://b2slab.upc.edu/software- 

and-downloads/metabolite-automatic-identification-toolkit/. 

Contact: francesc.fernandez.albert@upc.edu 

Supplementary information: Supplementary data are available at 
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1 INTRODUCTION 

Liquid chromatography and mass spectrometry (LC/MS) is an 
analytical technique used widely in metabolomics to detect mol- 
ecules in biological samples (Theodoridis et al., 2012). A wide 
array of software tools is available for LC/MS profiling data 
analysis, including commercial, programmatic and online tools. 
A commercial example is Analyst®, whereas some open-source 
packages permit programmatic processing, such as the R pack- 
age XCMS (Smith et al, 2006) to detect peaks or CAMERA 
(Kuhl et al., 2012) and AStream (Alonso et al., 2011) for peak 
annotations. There have been efforts on just peak annotation 
using JAVA (Brown et ah, 2011). MZmine and mzMatch are 
modularized tools coded in JAVA that are focused on LC/MS 
data preprocessing and visualization (Katajamaa et al., 2006; 
Pluskal et al., 2010; Scheltema et al., 2011). Online tools permit 
sample processing through a web Graphical User Interface, 
such as XCMSOnline (http://xcmsonline.scripps.edu) or 
MetaboAnalyst (Xia et ah, 2009). Refer to Supplementary 
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Table SI for a comparative between the capabilities for some 
of the main available tools. In this context, we introduce a new 
R package called Metabolite Automatic Identification Toolkit 
(MAIT) for automatic LC/MS analysis. The goal of the MAIT 
package is to provide an array of tools that makes program- 
mable metabolomic end-to-end statistical analysis possible (see 
Section 3 of the Supplementary Material for details about the 
MAIT modularity). MAIT includes functions to improve peak 
annotation through the process called biotransformations and to 
assess the predictive power of statistically significant metabolites 
that quantify class separability. 

2 METHODS 

MAIT includes the stages peak detection, peak annotation, statistical 
analysis and table and plots creation (Fig. 1 ). The peak detection stage 
detects the peaks in the LC/MS sample files. The peak annotation stage 
improves the identification of the metabolites in the metabolomic samples 
by increasing the chemical and biological information in the dataset. A 
statistical analysis reveals the significant sample features and measures 
their predictive power. MAIT uses the R package XCMS to detect and 
align peaks. For the peak annotation step, MAIT uses three steps: 

• First, MAIT uses the CAMERA package to perform the first anno- 
tation step (Kuhl et al,, 2012). In this stage, MAIT uses a peak 
correlation distance and a retention time window to find which 
peaks came from the same source metabolite based. The peaks 
within each peak group are annotated following a reference 
adduct/fragment table and a mass allowance window. 

• Biotransformations could be related to specific in-source mass losses. 
Therefore, in the second annotation step, they are detected using a 
mass allowance window inside the peak groups (Breitling et al., 
2006). For this search, MAIT already includes a biotransformations 
table (here Human biotransformations). User-defined biotransform- 
ation tables can be set as input, following the procedure defined in 
Supplementary Text (Section 6.6). 

• Finally, a predefined metabolite database is mined for significant 
masses. This identifies metabolites with the help of the Human 
Metabolome Database (Wishart et al.. 2009), 2009/07 version. 

The objective of analysing the metabolomic profiling data is to obtain 
the statistically significant features (SSF) that contain the highest amount 
of class-related information. To gather these features, MAIT can apply 
statistical tests such as ANOVA or Student's t-test to every feature, 
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Fig. 1. Correspondence between MAIT functions (centre column), gen- 
erated output files (left column) and their functionality (right column) 

selecting the significant set of features given a threshold P-value. 
A validation test is included to quantify SSF class separability by a re- 
peated random subsampling cross-validation using three methods: partial 
least squares and discriminant analysis, support vector machines and K- 
nearest neighbours (Hastie et al, 2009). MAIT computes overall and 
class-related classification ratios to evaluate the SSF class-related 
information. 



3 RESULTS 

The example data files are a subset of the data used in the ref- 
erence (Saghatelian et ah, 2004), which are distributed freely 
through the faahKO package (Smith, 2012). MAIT was used 
to read and analyse these samples using the functions depicted 
in Figure 1 (see the tutorial in the Supplementary Information). 
The significant features for each class are found using statistical 
tests and analysed through the different plots that MAIT pro- 
duces. Using the following function call, 2640 peaks were 
detected: 

R> MAIT <- sampleProcessing (dataDir 
= 1 1 Dataxcms ' ' , project = 1 'MAIT_Demo ' ' , 
snThres = 2 , rtStep = 0.03) 

At this point, the first annotation stage is launched: 

R> MAIT <- peakAnnotation (MAIT. object 
= MAIT) 

Next, we gather the significant features from the peaks de- 
tected. After the Welch's tests, 106 of these features were 
found to be significant through the spectralSigFeatures function. 
Statistical plots such as heat maps, boxplots and principal com- 
ponent analysis score plots can be generated (Supplementary 
Figs S3 and S4). Significant features are annotated after checking 
for certain neutral losses (biotransformations). 

R> MAIT <- spectralSigFeatures (MAIT, 
P= 0.05) 



R> MAIT <- Biotransformations (MAIT, 
peakPrecision = 0.005) 

By using only the SSF, a validation stage is launched, obtain- 
ing a classification ratio of 100% with three training samples for 
all classifiers. These results suggest that the significant variables 
separate both classes completely. 

R> MAIT <- Validation (MAIT, Iterations = 20, 
trainSamples = 3) 

Finally, the database is mined to identify the significant 
features. 

R> MAIT <- identifyMetabolites (MAIT, 
peakTolerance = 0.005) 



4 CONCLUSIONS 

MAIT provides a set of tools and functions to perform an auto- 
matic end-to-end analysis of LC/MS metabolomic data, putting 
special emphasis on peak annotation and metabolite identifica- 
tion. In addition, MAIT validation functions make it possible to 
estimate predictive power for significant variables. 
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